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The imposition of real-time constraints on a parallel computing environment — specifically high- 
performance, cluster-computing systems — introduces a variety of challenges with respect to the for- 
mal verification of the system's timing properties. In this paper, we briefly motivate the need for such 
a system, and we introduce an automaton-based method for performing such formal verification. We 
define the concept of a consistent parallel timing system: a hybrid system consisting of a set of timed 
automata (specifically, timed Biichi automata as well as a timed variant of standard finite automata), 
intended to model the timing properties of a well-behaved real-time parallel system. Finally, we give 
a brief case study to demonstrate the concepts in the paper: a parallel matrix multiplication kernel 
which operates within provable upper time bounds. We give the algorithm used, a corresponding 
consistent parallel timing system, and empirical results showing that the system operates under the 
specified timing constraints. 

1 Introduction 

Real-time computing has traditionally been considered largely in the context of single-processor and 
embedded systems, and indeed, the terms real-time computing, embedded systems, and control systems 
are often mentioned in closely related contexts. However, real-time computing in the context of multin- 
ode systems, specifically high-performance, cluster-computing systems, remains relatively unexplored. 
It can be argued that one reason for the relative dearth of work in this area is the lack of scenarios to date 
which would require such a system. Previously lITTllTZl . we have motivated the emerging need for such 
an infrastructure, giving a specific scenario related to the next generation North American electrical grid. 
In that work, we described the changes and challenges in the power grid driving the need for much higher 
levels of computational resources for power grid operations. To briefly summarize (and to provide some 
motivational context for the current work), many of these computations — particularly floating-point in- 
tensive simulations and optimization calculations (El |3] HO [U 10]) — can be more effectively done in a 
centralized manner, and the amount and scale of such data is estimated by some ifTTl [T2l to be on the 
order of terabytes per day of streaming sensor data (e.g. Phasor Measurement Units (PMUs)), with the 
need to analyze the data within a strict cyclical window (every 30ms), presumably with the aid of high- 
performance, parallel computing infrastructures. With this in mind, the current work is part of a larger 
research effort at Pacific Northwest National Laboratory aimed at developing the necessary infrastruc- 
ture to support an HPC cluster environment capable of processing vast amounts of streaming sensor data 
under hard real-time constraints. 

While verifying the timing properties of a more traditional (e.g. embedded) real-time system poses 
complex questions in its own right, imposing real-time constraints on a parallel (cluster) computing 
environment introduces an entirely new set of challenges not seen in these more traditional environments. 
For example, in addition to standard real-time concepts such as worst-case execution time (WCET), real- 
time parallel computation introduces the necessity of considering worst case transmission time when 
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communicating over the network between nodes, as well as the need to ensure that timing properties of 
one process do not invalidate those of the entire parallel process as a whole. 

These are but two examples of the many questions which must be addressed in a real-time parallel 
computing system; certainly there are many more questions than can be addressed in a single paper. To 
this end, we introduce a simple, event driven, automata-based model of computation intended to model 
the timing properties of a specific class of parallel programs. Namely, we consider SPMD (Single Pro- 
gram, Multiple Data), parent-child type programs, in part because in practice, many parallel programs — 
including many prototypical MPI-based lfl3l [T5l programs — fall into this category. We give an exam- 
ple of such a program in Section [3] This model is typified by the existence of a cyclic master or parent 
process, and a set of noncyclic child or slave processes amongst which work is divided. With this charac- 
terization, a very natural correspondence emerges between the processes and the automata which model 
them: the cyclic parent process is very naturally modeled by an co-automaton, and the child processes by 
a standard finite automaton. Our main contribution of this paper, then, is twofold: first, a formal method 
of modeling the respective processes in this manner, combining these into a single hybrid system of par- 
allel automata, and secondly, a simple case study demonstrating a practical application of this system. 
We should note that the notion of parallel finite automata is not a new one; variants have been studied 
before (e.g. ||6l[l4]). We take the novel approach of combining timed variants (fl]|3) of finite automata 
into a single hybrid model which captures the timing properties of the various component processes of a 
parallel system. 

The rest of the paper proceeds as follows: Section |2] defines the automaton models used by our 
system: Timed Finite Automata in Section 12.11 Timed Biichi Automata in Section 12.21 and a hybrid 
system combining these two models in Section 12.31 Section [3] gives a case study in the form of an 
example real-time matrix multiplication kernel, running on a small, four-node real-time parallel cluster. 
Section @]concludes. 

2 Formalisms 

In this section, we give formal definitions for the machinery used in our hybrid system of automata. The 
definitions given in Sections l2~T1 and [Z2l are not new |fl]|. However, it is still important that we state their 
definitions here, as they are used later on, in Section 12.31 

2.1 Timed Finite Automata 

In this section, we define a simple timed extension of traditional finite state automata and the words they 
accept. We will use these in later sections to model the timing properties of child processes in a real-time 
cluster system. 

Timed strings take the form (cr,f), where a is a string of symbols, and f is a monotonically in- 
creasing sequence of reals (timestamps). t v denotes the timestamp at which symbol a x occurs. We also 
use the notation (a x ,z x ) to denote a particular symbol/timestamp pair. For instance, the timed string 
((abc), (1, 10, 11)) is equivalent to the sequence (a, l)(b, 10)(c, 11), and both represent the case where 
'a' occurs at time 1, 'b' at time 10, and 'c' at time 11. 

Correspondingly, we extend traditional finite automata to include a set of timers, which impose tem- 
poral restrictions along state transitions. A timer can be initialized along a transition, setting its value 
to when the transition is taken, and it can be used along a transition, indicating that the transition can 
only be taken if the value of the timer satisfies the specified constraint. Formally, we associate with each 
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automaton a set of timer variables f, and following the nomenclature of HI, an interpretation v for this 
set of timers is an assignment of a real value to each of the timers in T. We write V [T i->- 0] to denote the 
interpretation v with the value of timer T reset to 0. Clock constraints consist of conjunctions of upper 
bounds: 

Definition 1. For a set T of clock variables, the set X(T) of clock constraints % is defined inductively as 

X-=(T<c)l\xiAX2 

where Tis a clock in T and c is a constant in M + . 

While this definition may seem overly restrictive compared to some other treatments (e.g. [T|), we 
believe it to be acceptable in this early work for a couple of reasons. First, while simple, this sole syntactic 
form remains expressive enough to capture an interesting, non-trivial set of use cases (e.g. Section |3). 
Secondly, the timing analysis in subsequent sections of the paper becomes rather complex, even when 
timers are limited to this single form. Restricting the syntax in this manner simplifies this analysis to a 
more manageable level. We leave more complex formulations and the corresponding analysis for future 
work. 

Definition 2 (Timed Finite Automaton (TFA)). A Timed Finite Automaton (TFA) is a tuple 

{L,Q,s,q f ,T,8,Y,Tl) 

, where 

• E is a finite alphabet, 

• Q is a finite set of states, 

• s E Q is the start state, 

• (Jf &Qis the accepting state, 

• Tis a set of clocks, 

• 8 C.Qx QxJLis the state transition relation, 

• 7 C 8 x 2 T is the clock initialization relation, and 

• T] C 8 x X(T) is the constraint relation. 

A tuple (qi,qj, o) E 8 indicates that a symbol O yields a transition from state qi to state qj, subject to the 
restrictions specified by the timer constraints inr\. A tuple (qi,qj,G,{Ti,...,T„}) E 7 indicates that on the 
transition on symbol o from qj to qj, all of the specified timers are to be initialized to 0. Finally, a tuple 
(qi,qi,G,X(T)) E T] indicates that the transition on o from qi to qj can only be taken if the constraint 
X(T) evaluates to true under the current timer interpretation. 

Example 3. The following TFA accepts the timed language {{ab*c,X\...T n ) \ % n — %\< 10} {i.e., the set 
of all strings consisting of an 'a ', followed by an arbitrary number of 'b 's, followed by a 'c ', such that 
the elapsed time between the first and last symbols is no greater than 10 time units). The start state is 
denoted with a dashed circle, and the accepting state with a double line. 
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Paths and runs are defined in the standard way: 

Definition 4 (Path). Let Abe a TFA with state set Q and transition relation 8. Then (q\,...,q n ) is a path 
over A if, for all 1 < i < n, 3a.(qi,qi + i , a) £ 8. 

Definition 5 (Run). ArunrofaTFA (L,Q,qo,qf ,r,5,7,T]) over a timed word (o,x), is a sequence of 
the form 

r: (q ,v ) ^ (<?i,Vi) % (<72,v 2 ) % ... ^ (q n ,v n ) 

Ti T2 T3 T„ 

satisfying the following requirements: 

• Initialization: Vo(k) = 0,Vfc £ 7 1 

• Consecution: For all i > 0: 

- 5 9 {quq i+ i,Gi), 

- (V/_i + T,--Ti_i) satisfies Xu where r\ 3 (qi-\,qi,G h Xi), and 

- Vi = (V f _i +T i -T,_ 1 )[r^0] > Vr£ ?, w/jere 7 9 (qi,qj,Oi,T) 

r is an accepting ra« if = 

A TFA A accepts a timed string 5 = (a, Ti...T„) if there is an accepting run of s over A, and % n — %\ 
is called the duration of the string. 

Note (Well-Formedness). We introduce a restriction on how timers can be used in a TFA, thus defining 
what it means for a TFA to be well-formed. Namely, we restrict timers to be used only once along a 
path; this is to simplify somewhat the timing analysis in subsequent sections. In particular, we say that a 
TFA A is well-formed if, for all pairs of states (q x ,q y ), all timers T, and all paths from q x to q y , Tis used 
no more than once. For example, the TFAs shown in Figure\J}are not well-formed, since in both cases, 
timers can potentially be used more than once — in the first case (A J, along the self-loop on qi, and 
in the second case (A2), along two separate transitions along the path. At first, this may appear to be 
overly restrictive, but as it turns out, many of these cases can easily be rewritten equivalently to conform 
to the single-use restriction, as shown in Figured 

b ; (T< 10)? 

a;T=0 

Ai: m?i S i n^2) 




^ s a ;(T=0)Th ;(T<10)7 c ;(r<20)^ 
A 2 : 1 qi 1 »(q2j *(<73 

Figure 1: Malformed TFAs. Start states are denoted with a dashed circle, and accepting states with a 
double line. The intent of Ai is to allow strings of the form a, followed by arbitrarily many bs, as long 
as they all occur less than 10 units after the a, followed by a c. The intent of A 2 is to allows strings of 
the form abc, where the elapsed time between the a and b is less than 10, and that between the a and c is 
less than 20. Both of these can be rewritten using conforming automata, as shown in Figure [2] 
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a ; T=0 



A : i qi h- 

1 v y 




;(r<10)? c 

[qib] — -*({q3 



A' 2 : iqi\ 



,- s n=o (7i<io)? (r 2 <20)? 

qi ) k <?3 ) *(( <?4 



r 2 = o 



Figure 2: Equivalent, well-formed versions of automata from Figured] 



2.1.1 Bounding Maximum Delay 

An important notion throughout the remainder of the paper is that of computing bounds on the allowable 
delays along all possible paths through a TFA. Specifically, we are interested in doing so to be able to 
reason formally about the maximum execution time for a child process, with the end goal of being able 
to bound the execution time of the system — parent and all child processes — as a whole. 

The idea is that we will ultimately use TFAs to represent the timing properties of a child process. 
Paths through the automaton from its start state to an accepting state correspond to possible execution 
paths of the child process' code. Certainly, proving a tight upper bound on the delay between two 
arbitrary points along an execution path remains a very difficult problem, but to be clear, this is not 
our goal. Rather, our approach involves modeling an execution path through a child process (and, by 
extension, its corresponding timed automaton) using an event-based model, in which selected system 
events are modeled by transitions in the automaton, and we rely on timing properties of the process to be 
guaranteed by the underlying RTOS process scheduler. The problem of computing the worst-case delay 
through the automaton equates to that of computing the maximum delay over all possible paths through 
the automaton from its start state to its accepting state: 

Aa = max A(p) 

pepaths(A) 

where 

• A=(E,£ 5 4o,?/,r,S,y,7])istheTFA 

• paths(A) denotes the set of all paths in A from its start state go to accepting state qf, and 

• A(p), for path p = (qo, ...,q/), denotes the maximum delay from qo to qf. That is, the maximum 
duration of any timed string (a, f ) such that (qo...qf, v) is a run of the string over A (for some v). 

This problem can thus be formulated in the following manner: given a timed finite automaton A and 
an integer n, is there a timed word of duration d > n that is accepted by A? While simple cases, such as 
those presented in this paper, can be computed by observation and enumeration, the complexity of the 
general problem remains an open question, although we highly suspect it to be intractable — Courcou- 
betis and Yannakakis give exponential-time algorithms for this and related problems, and have shown a 
strictly more difficult variant of the problem to be PSPACE-complete [4J. Furthermore, expanding the 
timer constraint syntax to a more expressive variant (c.f. [ Q) can only complicate matters in terms of 
complexity. We must be cautious, then, to ensure that we do not impose an inordinately large number of 
timers on a child process. 
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2.2 Timed Biichi Automata 

Whereas we model the timing properties of the child processes of a cluster system using the timed 
finite automata of the previous section, we model these properties of the parent using a timed variant 
of co-automata, specifically Timed Biichi Automata. We assume a basic familiarity with these; due 
to space constraints, we give only brief overview here. To review briefly, co-automata, like standard 
finite automata, also consist of a finite number of states, but instead operate over words of infinite length. 
Classes of co-automata are distinguished by their acceptance criteria. Biichi automata, which we consider 
in this paper, are defined to accept their input if and only if a run over the input string visits an accepting 
state infinitely often. Other classes of co-automata exist as well. For example, Muller automata are more 
stringent, specifying their acceptance criteria as a set of acceptance sets; a Muller automaton accepts its 
input if and only if the set of states visited infinitely often is specified as an acceptance set. More detailed 
specifics can be found elsewhere — for example, fl]. 

A Timed Biichi Automaton (TB A) is a tuple (£, Q , qo , q* , T, 8 , 7, rj ) , where 

• £ is a finite alphabet, 

• Q is a finite set of states, 

• <7o £ Q is the start state, 

• F C Q is a set of accepting states, 

• Tis a set of clocks, 

• SCQxQx'Lis the state transition relation, 

• 7 C 8 x 2 T is the clock initialization relation, and 

• rj C 8 x X(T) is the constraint relation. 

A tuple (qi,qj : a) S 8 indicates that a symbol a yields a transition from state q\ to state qj, subject 
to the restrictions specified by the clock constraints in rj. A tuple (qi,qj,G,T) € 7 indicates that on 
the transition on symbol a from qi to qj, all clocks in f are to be initialized to 0. Finally, a tuple 
(qi,qj,a,X(T)} € tj indicates that the transition on a from qi to qj can only be taken if the constraint 
X(T) evaluates to true under the values of the current timer interpretation. 

We define paths, runs, and subruns over a TBA analagously to those over a TFA: 

Definition 6 (Path (TBA)). Let si he a TBA with state set Q and transition relation 8. (qi,...,q n ) is a 
path over A if, for all 1 <i <n, 3a.(^,^ + i,a) £ 8. 

Definition 7 (Run, Subrun (TBA)). A run (subrun) r, denoted by (q,v), of a Timed Biichi Automaton 
(£, Q,qQ,qf,T,8,y,r\) over a timed word (a, f ), is an infinite (finite) sequence of the form 

r:{qo,v ) ^> (?i,Vi) ^> (q 2 , v 2 ) ^ ... 

T[ T 2 T 3 

satisfying the same requirements as given in Definition \5\ 

For a run r, the set inf(r) denotes the set of states which are visited infinitely many times. A TBA 
si with final states F accepts a timed word w = (a, f ) if inf(r)f]F ^ 0, where r is the run of w on sJ '. 
That is, a TBA accepts its input if any of the states from F repeat an infinite number of times in r. 

Example 8. Consider the following TBA with start state q\ and accept states F = {qi}: 
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c ;{T < 50)? 



b 




a;T = 




This TBA accepts the CO-language L\ = {((ab*c) w , t) | Vx3i,j.Vk.(j)} where is the boolean formula 



Lastly, we take the concept of maximum delay, introduced in the previous section with respect to 
Timed Finite Automata, and extend it to apply to Timed Biichi Automata. Doing so first requires the 
following definition, which allows us to restrict the timing analysis for TBAs to finite subwords: 
Definition 9 (Subword over q). Let si be a TBA, and let q = (q m ...q n ) be a finite path over sf. A finite 
timed word w = (((T m ...(T„), (x m ...X n )) is a subword over q iff3qo,...,q m -i , Co, <7 m _i, To, %n-\ such 
that (qo...q m -\q m ...q n ,v) is a subrun of ((oo...O m -iO m ...O„), (To- )) over si for some V. 

Definition [9] is a technicality which is necessary to support the following definition of the maximum 
delay between states of a TBA: 

Definition 10. Let si be a TBA, and let q be a finite path over s/. Then A f/ /(q) is the maximum duration 
of any subword over q. 

Example 11. Consider s/\ from Example^ Then A^ (qiq2qiqi) = 50. 

Algorithmically computing A^(q) for a TBA s/ is analogous to the case for TFAs; in small cases 
(i.e., relatively few timers with small time constraints), the analysis is relatively simple, while we con- 
jecture the problem for more complex cases to be intractable; we leave more detailed analysis for future 
work. 

2.3 Parallel Timing Systems 

Next, we model the timing properties of a SPMD-type parallel system as a whole by combining the two 
models of Sections |2T1 and |Z21 into a single parallel timing system. A parallel timing system (PTS) is a 
tuple (P, A, Xjf, q>), where 

• P = (L,Q,qo,qf,T,8,Y,ri) is a TBA (used to model the timing properties of the parent process) 

• A is a set {Ai , A„} of TFAs (used to model the timing properties of the child processes) 

• y C 8 x A is a. fork relation (used to model the spawning of child processes) 

• (p C 8 x A is a join relation (used to model barriers (joins)) 

A tuple {qi,qj, <T,A) in yr, with A £ A, indicates that an instance of A is to be "forked" on the transition 

l'(A) 

from qi to qj on symbol a, and this "fork" is denoted graphically as qt > qj, modeling the spawning 

of a child process along the transition. Similarly, a tuple (qj,qj,o,A) in (p indicates that a previously 
forked instance of A is to be "joined" on the transition from q, to qj on symbol a. This "join" is denoted 

Q(A) 

graphically as qi > qj, modeling the joining along the transition with a previously spawned child 

procesS 

1 was chosen as the symbol for 'fork', as it graphically resembles a "fork"; £1 was chosen as that for 'join', as it connotes 
"ending" or "finality". 
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Example 12. Consider the following timing system Si = (P, {A},y, (p): 
c ;(T < 50)? 



P: 




*¥(A) 



A: 



i s\ 1- 



; U = 



1 ; (t/ < 10)? 
*2 ) *(( ^3 



P is the parent TBA with initial state q\ and final state set F = {^2}- P accepts the (O-language L\ 
( see p. I?5l ). and A is a TFA which accepts the timed language { (01 , X\ T2) | T2 — T i < 10}. In addition, the 
fork and join relations \jf and (p dictate that on the transition from q\ to q2, an instance of A is forked 
ft! (A)), and that the transition from q2 to q\ can only proceed once that instance of A has completed 
fO(A)j. 

Conceptually, this system models a parent process (P) which exhibits periodic behavior, accepting 
an infinite number of substrings of the form ab*c, in which the initial 'a' triggers a child process A which 
must be completed prior to the end of the sequence, marked by the following 'c'. In addition, the 'c' must 
occur no more than 50 time units after the initial 'a '. The child process is modeled by A, which accepts 
strings of the form 01, in which the 1 must occur no more than ten time units after the initial 0. 

In theory, child processes could spawn children of their own (e.g. recursion). For now, however, 
we disallow this possibility, as it somewhat complicates the analysis in the following section without 
adding significantly to the expressive power of the model. The model can be expanded later to allow for 
arbitrarily nested children of children with the appropriate modifications; specifically, TBAs would need 
to be extended to include their own y and (p relations, as would the definition of A for TBAs. 

Before proceeding, it is important to note that a PTS 5 = (P, A, y, (p) is not itself interpreted as an 
automaton. In particular, we do not ever define a language accepted by S. Indeed, it is not entirely clear 
what such a language would be, as we never specify the input to any of the children in A. Rather, the 
sole intent in specifying such a system S is to specify the timing behavior of the overall system, rather 
than any particular language that would be accepted by it. 



2.3.1 Consistency 



With this said, we note that in Example [121 A is in some sense "consistent" with its usage in P. Specifi- 
cally, since the maximum duration of any string accepted by A is 10, we are guaranteed that any instance 
of A forked on the q\ A- q^ transition will have completed in time for the 'join' along the q2 A q\ tran- 
sition and hence, the timer (r < 50)? on this transition would be respected in all cases. In this sense, 
all (*P(A),n(A)) pairs are consistent with timer T . However, such consistency is not always the case. 
Consider, for instance, the parallel timing system S2 shown in Figure [3] In this case, there are two child 

c ; (T <25)? 



12(B) 



qi 1- 



a ; T = 
¥(A) 



B: 



1 si 1- 



1 si 1- 



0;U = 



; V = 



n(A);¥(B) 

Figure 3: An inconsistent parallel timing system 52- 



1 ; (17 < 10)? 
1 ; (V<20)? 
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processes: A and B. The maximum duration of a timed word accepted by A is 10, and that of B is 20. 
Supposing that an 'a' occurs (and A forked) at time 0, it is thus possible that the A will not complete 
until time 10 — E\, at which time the 'b' and fork of B can proceed. It is therefore possible that B will not 
complete until time 30 — £i — £2 (for small £1 , £2). This would then violate the (T < 25)? constraint, cor- 
responding to a case in which a child process could take longer to complete than is allowable, given the 
timing constraints of the parent process. It is precisely this type of interference which we must disallow 
in order for a timing system to be considered consistent with itself. 

To this end, we propose a method of defining consistency within a timing system. Informally, we 
take the approach of deriving a new set of conditions from the timing constraints of the child processes, 
so that checking consistency reduces to the process of verifying that these conditions respect the timing 
constraints of the master process. 

First, we replace A, y, and (p from the parallel timing system with a new set of derived timers, one 
for each A G A, defining the possible "worst case" behavior of the child processes. Each such timer 7a 
is initialized on the transition along which the corresponding A is forked, and is used along (constrains) 
any transitions along which A is joined. Each such use ensures that the timer is less than Aa, representing 
the fact that the elapsed time between the forking and joining of a child process is bounded in the worst 
case by Aa — the longest possible duration for the child process. As an example, "flattening" the timing 
system Si of Example [T2l results in a single new timer 7a, initialized along the 171 A q2 transition, and 
used along the q2 — > qi transition with the constraint (7a < 10)?. We then check that none of these new 
derived timers invalidate the timing constraints of the parent process. 

Formally, we define two relations. The first of these is flattening, which takes a parallel timing system 
(P,A, y, <p) and yields a new pair of relations (7,17). Intuitively, / defines the edges along which each 
of the derived timers are initialized, and v\ defines the edges along which each of the derived timers are 
used: 

Definition 13. Let S = (P,A, y, (p) be a parallel timing system. Then flatten(5') = (7, T]), where 

7 = {{qi, 1j,c,{ T A}) I {quqj,a,X) g y} 
n = {(qi,qj,o,x) I (qi,qj,o,x) e 9} 

and 

X= /\ (r A < A A ) 

The second relation takes y and q> as inputs and extracts a set of edge pairs, defined such that each 
such pair (ei,e2) specifies when a derived timer is initialized (e{) and used (^2). 

Definition 14. Let S = (P,A, y, (p) be a parallel timing system, with A G A. Then the set of all use pairs 
of A in S is defined as pairs(A,S) = {({q x ,qy),(q m ,qn)) I ({qx,q y ,Oi,A} G W) A ((?m,9n,02,A) G <p)} 
for some Oi , 02- Furthermore, 

pairs (S) = [J pairs (A, S) 

AeA 

Example 15. Consider parallel timing system S3 shown in Figure^ Observe that A\ = 25 and Ab = 1 1. 
Then: flatten^) = (7,77), where 

7={(qi,q2,a,{T A }),{q 2 ,q3,b,{T B }}} 
T]={{q 3 ,q u c,X)}, where X = (T A < 25) A (7b < 11) 
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Figure 4: Parallel timing system S3. Aa = 25, Ab = 11. 



shown graphically in Figure \5\ and 

pairs (5) = pairs (A , S) U pairs (B , S) 

= {((<7i,<72),(<73,<7i))}U {((<? 2 ,<73), fe, <?i))} 

= {((21,92), (q3,qi)), ((q2,qs), (q3,qi))} 




Figure 5: The result of flattening S3: forks and joins of A and B are shown along with derived timers Ta 
and T%. Compare with Figured 

We can now proceed with a formal definition of consistency for a parallel timing system. Recall that 
intuitively, such a system is consistent if the worst case timing scenarios over all child processes will not 
invalidate the timing constraints of the parent process — in other words, if the maximum delay between 
two states allowed by the child processes never exceeds the corresponding maximum delay allowed by 
the timers in the parent process. 

Definition 16 (Consistency). Let S = y/~, <p, A) be a PTS, where 

• = (L,Q,q ,F,T,S,Y,ri) is aTBA 

• flatten(5) = (/,tj') 
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• s^' = {Z,Q,q ,FX8,Y,r]') 

Then S is consistent if for all edge pairs ((q x ,q y ),(q m ,q n )) € pairs (5) and all paths p = q x q y ...q m q n 
through stf, 

A^(p)<A^(p) (16.1) 



We conclude this section with a few simple examples, which should help to clarify Definition [[6l the 
following section gives a more realistic example. 

Example 17. S\ is consistent. 

Proof. P', the result of flattening Si, is shown below, with T\ being the derived timer corresponding to 
A: 

c ; (T < 50)? 



h 01 h 



(?a<10)? b 
a;T = 
r A = 




Furthermore, pairs (P') = {((01,02), (02, 0i))}, and by observation, all paths through P' beginning 
with the edge (q\ , 172) and ending with the edge (qi , qj) take the form 171 (c[2)*q\ . All such paths p satisfy 
inequality 116.11 and thus by definition, S\ is consistent. □ 

Example 18. S3 is not consistent. 

Proof. P', the result of flattening S3, is shown in Figure [5] Furthermore, 

pairs(P') = {((91,32), (?3,?i)), ((02,93), (?3,?i))} 

There are thus two paths against which we need to test inequality 116.11 (0i02030i)> and (029391); the 
first of these fails the test: A P > (01029391) = 25,and Ap(qiq2qsqi) = 24. 

□ 



3 Case Study: Matrix Multiplication 

We now turn our attention to a practical application of the concepts discussed so far. Namely, we demon- 
strate the use of the formal validation concepts on a simple parallel, MPI-style |[T3l[T5l matrix multipli- 
cation kernel, extracted from the larger power-grid analysis application described in lfTTllT2l . Our kernel 
implements a variant of Fox's algorithm for matrix multiplication Q. For simplicity, we assume square 
matrices, and that the number of columns, rows, and processors are all perfect squares. The algorithm 
distributes the task of multiplying two matrices amongst all processors in the system. 

We give a simple distributed algorithm for matrix multiplication, and a consistent parallel timing 
system for that algorithm. We conclude the section with empirical results — timing measurements taken 
on a small, four-node real-time cluster, each node consisting of dual quad-core 2.66Ghz Xeon X5660 
processors running the Xenomai RTOS with 48GB RAM. The timing measurements of the PTS, along 
with the usual restrictions associated with real-time computation (e.g. no virtual memory or paging, 
process scheduling, ensuring minimal variance in execution timings, etc.), are bounded by virtue of 
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Xenomai's real-time process scheduler. The result is a matrix multiplication kernel which provably runs 
in under 9 ms per cycle for 128 x 128 double-precision matrices. We emphasize that we are not claiming 
the speed of the operation to be a groundbreaking result — obviously, this is a relatively small matrix 
size, but was so chosen as this is the order of the size required by our targeted application kernel. Rather, 
we give these numbers, as well as the PTS, to illustrate the process by which we analyze the temporal 
interactions between processes, thus showing this delay to be a provable upper bound. 

3.1 Algorithm 



Algorithm 1 MatrixMultiply: Compute C = A x B 



p : Number of processors 

N : Rank of matrices 

1: q<-y/p 

2: while true do 

3: dest «— 1 

4: if self == then {Master process} 

5: for i = to q— 1 do 

6: for j = to q — 1 do 

7: w 

8: y«-./?,Z<-(/+l)? 

9: X <- A[w: x][0 : N] 

10: Y 4— B[0 : Af][y : z] 

11: if i 7^ and j ^ then {Master already has these chunks} 

12: send(X, dest) 

13: send(Y,dest) 

14: dest 4- dest + 1 

15: end if 

16: end for 

17: end for 

18: else {Child processes} 

19: X^recv(O) 

20: Y 4- recv(0) 

21: end if 

22: Z<-l0CMM(X,F) 

23: reduce (Z,C) 

24: end while 



The pseudocode for the algorithm is given in Algorithm Q] Conceptually, to multiply two N x N 
matrices A and B using a p processor cluster, each matrix is divided into segments, which are then 
distributed in round-robin fashion amongst the processors of the cluster. Each processor then performs 
a local matrix multiplication on its own local submatrices, and the results of these local operations are 
aggregated (reduced) to form the matrix product A x B. 

Due to space constraints, we will not describe the partitioning in detail; Figure [6] shows the parti- 
tioning and distribution of work by Algorithm [j] for a four-processor cluster. In this figure, A,B, and C 
are allN xN matrices. A is partitioned into 2 sets of ¥ rows each, and B is partitioned into 2 sets of j 
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columns each. The master process, po, computes the local product Aj x Bi, and writes the result to Ci. 
Pq then sends submatrices Ai and B2 to p\, who then computes their product, writing the result to C2. 
Similarly, p2 receives and computes C3 = A2 x Bi, and pi receives and computes C4 = A2 x B2. 

The algorithm proceeds as follows: the master process executes lines IBI through [P71 which partition 
A and B into submatrices (lines ITUTOTi. and send these parts out to the respective child processes (lines 
[T2] - [T3T >. Conversely, the child processes execute lines [T9l - |20l which receive the submatrices assigned 
by the master process. Lines I22l - i23l are run by all processes, including the master process (which, in 
this case, participates in the task of matrix multiplication as well). Line 1221 performs the local operation, 
line [23] writes the local result to the appropriate location in C.The entire process then repeats indefinitely, 
as given by the while loop (lines l2l and l24b. 



Ci 


c 2 


c 3 


c 4 



ABC 

Figure 6: Partitioning and distribution of matrix multiplication by Algorithm Q] across a four processor 
cluster. 



3.2 Parallel Timing System 

Figure [7] shows a parallel timing system for Algorithm Q] across a four processor cluster, consisting of 
the TBA Pmm, which models the master process, and a child TFA Amm, modeling instances of the child 
processes. Specific events have been elided from the diagram in this case, since events in this case always 
represent transitions between statements. 

3.2.1 Parent 

States in the parent automaton P are prefixed with a 'P', followed by the line number as given in Algo- 
rithm Q] For example, /f3] corresponds to the state of the parent process as it is executing line [3] 

Additionally, lines [12] and [T3] each beget three separate states — parameterized on the values of the 
loop induction variables i and j — and are labeled accordingly. As is commonly the case in WCET 
analysis, unrolling the loop nest in this fashion is necessary in order to obtain a strict upper bound on the 
number of iterations and, consequently, the total execution time, of the loop nest. 

P forms, in this case, a simple cycle. The cycle starts at state P\3\ and steps sequentially through the 
steps (states) of the algorithm. Namely, the parent process starts at line [3] (i.e., state /Q, and proceeds 
sequentially through lines [4] (state P\4§, and eventually to line [721(^21^ ). The delay between the initial- 
ization (state and the first send (ifEo) is bounded by a timer, T setup \ (the idea being that this is the 
delay incurred by the time to "set up" the first send). Execution then proceeds to line H3l(fll3fe.°'): the 
delay along this transition represents the time to send the first chunk to the respective child process, and 
is bounded by timer T sen d\. At this point, execution proceeds to line [141 (iCSO )■ Along this transition, 
there are two items to note: first, the time to process the second send is bounded by the timer T sen di, and 
second, the child process has now been sent the data it needs, and consequently, Ai is forked. Execution 
proceeds similarly through the next six states, representing the unwound iterations of the loop nest. Child 
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process A 2 is similarly forked on the transition from ifP3"|~Q to fll4tr . and A 3 on the transition from 
/Qjfcj to /Q3o ■ Execution then proceeds through lines [22] (state 7 ^221) and 123] (/t23T). The duration 
of the local matrix multiplication operation (line |22]i is bounded by the timer Tmm, and that through the 
reduce operation (line [23} by the timer T rec j uce . Additionally, the transition from /t23lback to waits for 
(joins with) all child processes to complete before proceeding. 

3.2.2 Child 

In this case, the child processes are modeled by the TFA A. Nomenclature is analogous to that of P: 
states in A are prefixed with an A, followed by the corresponding line number from Algorithm [Q 

The child process starts at line [18] (state A[T8l). The process then proceeds to receive the first block 
of data (line [19] state A|T9b. The time to process the receive is bounded by timer T recv \ . Execution 
proceeds to receive the second block of data (line [20] state A[20l). The time to process this second receive 
is bounded by T reC v2- Execution proceeds next to the local matrix multiplication (line [22] state A[22)): 
the time spent on this operation is bounded by timer Ti oc mm- Finally, execution proceeds to the data 
writeback (line [23] state Al23l: the time spent on this operation is bounded by timer T re duce- 

Theorem 19. Smm is consistent. 

Proof. Let flatten(5 MM ) = (Y,r}'), with P^ M = (L,Q,q ,F,T,8,Y ,rj'). By definition, pairs(5 MM ) = 

{(e l ,e 4 ),(e2,e4),{e 3 ,e 4 )}, where 

ex = (PM]^,PMffi) e 2 = (/fl^./fl^J,) 

es = (PM=\ , PM=\ ) e 4 = (/El m 

By observation, there are three paths which we must consider: 

Pi = (/{I3|= 1 /il4|= p ../t23m]) 

P3 = (Pd2l=\PM=\..-rWm 
The rest of the proof follows by enumeration: 

(Ap;>i) = 6.1)<(A Pmm ( Pi ) = 8.1) 
( A ^>2) = 6.1)<(A Pmm ( P2 ) = 7.3) 
( A ^m(^) = 6.1)<(A Pmm ( w ) = 6.5) 

□ 

Finally, we note that the worst case delay along one iteration of the algorithm is 8.9 ms. This follows 
from the observation that the parent automaton P takes the form of a simple cycle with no unbound 
segments (i.e., subpaths which are not constrained by any timer). Specifically, P consists of consecutive 
pairs of segments, each constrained by pairs of timers. Consequently, we can derive an upper bound 
for a single iteration of the algorithm by summing the bounds of all of the timers, yielding the specified 
upper bound. Combined with Theorem [19] which ensures that the timing of the child processes does not 
invalidate this bound, we are left with a cyclic, parallel, time-bounded matrix multiplication kernel. 
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Figure 7: Parallel timing system Smm for Algorithm Q] across a four processor cluster. Events have been 
elided for the sake of clarity. Upper bounds on timer constraints correspond to delay measurements 
taken over our implementation; times are given in milliseconds. Minimal variance from these bounds is 
ensured to the extent provided by the underlying RTOS. 
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4 Concluding Remarks 

We conclude with a few closing remarks. We have presented a formal system for modeling the temporal 
properties of a restricted class of real-time parallel systems, with a simple example of an application 
kernel. As is usually the case with real-time systems, loops need to be unrolled, bounding the number 
of iterations, in order to obtain an upper bound on the total execution time of the loop. Algorithm [T] 
(intentionally) distills to a relatively simple PTS, due to the basic structure of the control flow graph of 
both the parent and child processes; more complex examples are of obvious interest for future work. 
Similarly, the model in Figure [7] in our case was derived manually — in this case, a relatively simple 
task. More complex examples can certainly prove to be more of a challenge, and automated tools for this 
task are desirable. One possible approach for such automation would be compiler-driven, whereby users 
could specify to the compiler (via #pragmas, for instance), events of interest, and the compiler could 
proceed to output the appropriate annotated control flow graph. 

We assume timing behavior is consistent across all child processes, although if there were to be sig- 
nificant variance across child processes (e.g. heterogeneous or NUMA architectures) we could account 
for such behavior using different child TFAs. 

Additionally, we have laid out several interesting open questions which arise out of the analysis of 
our relatively straightforward formulation: what is the complexity of computing the worst case delay 
along a single path of a TFA (TBA), and through a TFA (TBA) in general? Up to this point, we have 
only considered conjunctions of maximum constraints; how does this change in the presence of a more 
generalized constraint syntax (c.f. |Q])? 

We have largely been working with the SPMD execution model paradigmatic of many MPI-type 
programs. It would be interesting to investigate temporal models for other parallel models (e.g. OpenMP) 
as well. Lastly, our application kernel distills to a relatively simple set of automata. More complex 
examples are certainly of interest, and are on the horizon for future work. 
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